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Abstract 



o 

u 

j-^ Clustering of variables is as a way to arrange variables into homogeneous clusters, 

i.e., groups of variables which are strongly related to each other and thus bring the same 
information. These approaches can then be useful for dimension reduction and variable 
selection. Several specific methods have been developed for the clustering of numerical 
^ variables. However concerning qualitative variables or mixtures of quantitative and qual- 

itative variables, far fewer methods have been proposed. The R package ClustOfVar was 
specifically developed for this purpose. The homogeneity criterion of a cluster is defined 
as the sum of correlation ratios (for qualitative variables) and squared correlations (for 
quantitative variables) to a synthetic quantitative variable, summarizing "as good as pos- 
I sible" the variables in the cluster. This synthetic variable is the first principal component 

obtained with the PCAMIX method. Two algorithms for the clustering of variables are 
T-H proposed: iterative relocation algorithm and ascendant hierarchical clustering. We also 

^ propose a bootstrap approach in order to determine suitable numbers of clusters. We 

illustrate the methodologies and the associated package on small datasets. 



Keywords: Dimension reduction, hierarchical clustering of variables, k-means clustering of 
variables, mixture of quantitative and qualitative variables, stability. 



1. Introduction 

Principal Component Analysis (PCA) and Multiple Correspondence Analysis (MCA) are 
appealing statistical tools for multivariate description of respectively numerical and categorical 
data. Rotated principal components fulfill the need to get more interpretable components. 
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Clustering of variables is an alternative since it makes possible to arrange variables into 
homogeneous clusters and thus to obtain meaningful structures. From a general point of 
view, variable clustering lumps together variables which are strongly related to each other 
and thus bring the same information. Once the variables are clustered into groups such that 
attributes in each group reflect the same aspect, the practitioner may be spurred on to select 
one variable from each group. One may also want to construct a synthetic variable. For 
instance in the case of quantitative variables, a solution is to realize a PCA in each cluster 
and to retain the first principal component as the synthetic variable of the cluster. 

A simple and frequently used approach for clustering a set of variables is to calculate the 
dissimilarities between these variables and to apply a classical cluster analysis method to this 
dissimilarity matrix. We can cite the functions hclust of the R package stats (R Develop- 
ment Core Team 2011) and agnes of the package cluster (Maechler, Rousseeuw, Struyf, and 
Hubert 2005) which can be used for single, complete, average linkage hierarchical clustering. 
The functions diana and pam of the package cluster can also be used for respectively divisive 
hierarchical clustering and partitioning around medoids (Kaufman and Rousseeuw 1990). But 
the dissimilarity matrix has to be calculated first. For quantitative variables many dissim- 
ilarity measures can be used: correlation coefficients (parametric or nonpar ametric) can be 
converted to different dissimilarities depending if the aim is to lump together correlated vari- 
ables regardless of the sign of the correlation or if a negative correlation coeffcient between 
two variables shows disagreement between them. For categorical variables, many associa- 
tion measures can be used as x^j Rand, Belson, Jaccard, Sokal and Jordan among others. 
Many strategies can then be applied and it can be difficult for the user to choose oneof them. 
Moreover, no synthetic variable of the clusters are directly provided with this approach. 

Besides these classical methods devoted to the clustering of observations, there exists methods 
specifically devoted to the clustering of variables. The most famous one is the VARCLUS 
procedure of the SAS software. Recently specific methods based on PCA were proposed by 
Vigneau and Qannari (2003) with the name Clustering around Latent Variables (CLV) and 
by Dhillon, Marcotte, and Roshan (2003) with the name Diametrical Clustering. But all 
these specific approaches work only with quantitative data and as far as we know, they are 
not implemented in R. 

The aim of the package ClustOfVar is then to propose in R, methods specifically devoted to 
the clustering of variables with no restriction on the type (quantitative or qualitative) of the 
variables. The clustering methods developed in the package work with a mixture of quan- 
titative and qualitative variables and also work for a set exclusively containing quantitative 
(or qualitative) variables. In addition note that missing data are allowed: they are replaced 
by means for quantitative variables and by zeros in the indicator matrix for qualitative vari- 
ables. Two methods are proposed for the clustering of variables: a hierarchical clustering 
algorithm and a k-means type partitioning algorithm are respectively implemented in the 
functions hclustvar and kmeansvar. These two methods are based on PCAMIX, a principal 
component method for a mixture of qualitative and quantitative variables (Kiers 1991). This 
method includes the ordinary PCA and MCA as special cases. Here we use a Singular Value 
Decomposition (SVD) approach of PCAMIX (Chavent, Kuentz, and Saracco 2011). These 
two clustering algorithms aim at maximizing an homogeneity criterion. A cluster of variables 
is defined as homogeneous when the variables in the cluster are strongly linked to a central 
quantitative synthetic variable. This link is measured by the squared Pearson correlation for 
the quantitative variables and by the correlation ratio for the qualitative variables. The quan- 
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titative central synthetic variable of a cluster is the first principal component of PCAMIX 
applied to all the variables in the cluster. Note that the synthetic variables of the clusters 
can be used for dimension reduction or for recoding purpose. Moreover a method based on 
a bootstrap approach is also proposed to evaluate the stability of the partitions of variables 
and can be used to determine a suitable number of clusters. It is implemented in the function 
stability. 

The rest of this paper is organized as follows. Section 2 contains a detailed description of the 
homogeneity criterion and a description of the PCAMIX procedure for the determination of 
the central synthetic variable. Section 3 describes the clustering algorithms and the bootstrap 
procedure. Section 4 provides two data-driven examples in order to illustrate the use of the 
functions and objects of the package ClustOfVar. Finally, section 5 gives concluding remarks. 



Let {xi, . . . , Xpj} be a set of pi quantitative variables and {zi, . . . , Zp^} a set of p2 qualitative 
variables. Let X and Z be the corresponding quantitative and qualitative data matrices of 
dimensions nxpi and nxp2, where n is the number of observations. For seek of simplicity, we 
denote Xj G 7^" the j-th column of X and we denote Zj £ Mi x . . . x the j-th column of 
Z with M.j the set of categories of zj. Let Pk = (Ci, . . . , Ck) be a partition into K clusters 
of the p = pi + P2 variables. 

Synthetic variable of a cluster Ck- It is defined as the quantitative variable S TZ"^ 
the "most linked" to all the variables in Ck- 



where denotes the squared Pearson correlation and rj'^ denotes the correlation ratio. More 



where is the frequency of category s, is the mean value of u calculated on the observations 
belonging to category s and u is the mean of u. 

We have the fohowing important results (Escofier (1979), Saporta (1990), Pages (2004)): 

• Yk is the first principal component of PCAMIX applied to X^ and Zk, the matrices 
made up of the columns of X and Z corresponding to the variables in Ck', 

• the empirical variance of is equal to: VAR(yfc) = ^ ''^Xj,yk + ^ ^y^lzj- 



The determination of y^ using PCAMIX is carried on according to the following steps: 



2. The homogeneity criterion 





zj-eCfe 



1. Recoding of X^ and Z^: 
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(a) Xfc is the standardized version of the quantitative matrix X^, 

(b) Zfc = JGD~^/^ is the standardized version of the indicator matrix G of the quah- 
tative matrix Z^, where D is the diagonal matrix of frequencies of the categories. 
J = I — 1'1/nis the centering operator where I denotes the identity matrix and 1 
the vector with unit entries. 

2. Concatenation of the two recoded matrices: M^. = (X^IZ^). 

3. Singular Value Decomposition of M^: = UAV. 

4. Extraction/calculus of useful outputs: 

• y/nJJA is the matrix of the principal component scores of PCAMIX; 

• Yfc is the first column -^/nUA; 

• VAR(yfc) = Xi where Xi is the first eigenvalue in A. 

Note that we recently developed an R package named PCAmixdata with a function PCAmix 
which provide the principal components of PCAMIX and a function PCArot which provide 
the principal component after rotation. 



Homogeneity H oi a cluster Ck- It is a measure of adequacy between the variables in 
the cluster and its central synthetic quantitative variable y^: 

Hm = E -x,,y, + E ^?.iz, = A^. (1) 

Note that the first term (based on the squared Pearson correlation r^) measures the link 
between the quantitative variables in Ck and y^ independently of the sign of the relationship. 
The second one (based on the correlation ratio r/^) measures the link between the qualitative 
variables in Ck and y^. The homogeneity of a cluster is maximum when all the quantitative 
variables are correlated (or anti-correlated) to y^ and when all the correlation ratios of the 
qualitative variables are equal to 1. It means that all the variables in the cluster Ck bring the 
same information. 

Homogeneity ^ of a partition Pk- It is defined as the sum of the homogeneities of its 
clusters: 

K 

^(P^) = E^(Cfe) = Ai + ... + Af, (2) 

k=l 

where A}, . . . , Af are the first eigenvalues of PCAMIX applied to the K clusters Ck of Pk- 



3. The clustering algorithms 

The aim is to find a partition of a set of quantitative and/or qualitative variables such that the 
variables within a cluster are strongly related to each other. In other words the objective is 
to find a partition Pk which maximizes the homogeneity function T-L defined in (2). For 
this, a hierarchical and a partitioning clustering algorithms are proposed in the package 
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ClustOfVar. A bootstrap procedure is also proposed to evaluate the stability of the partitions 
into K = 2,3, . . . ,p — 1 clusters and then to help the user to determine a suitable number of 
clusters of variables. 

The hierarchical clustering algorithm. This algorithm builds a set oip nested partitions 
of variables in the following way: 

1. Step 1 = 0: initialization. Start with the partition in p clusters. 

2. Step I = 1, . . . ,p — 2: aggregate two clusters of the partition mp — l + l clusters to get 
a new partition in p — I clusters. For this, choose clusters A and B with the smallest 
dissimilarity d defined as: 

d{A, B) = H{A) + H{B) - H{A UB) = \\ + Xl- Ai^s- (3) 

This dissimilarity measures the lost of homogeneity observed when the two clusters A 
and B are merged. Using this aggregation measure the new partition in p — I clusters 
maximizes H among all the partitions in p — I clusters obtained by aggregation of two 
clusters of the partition in p — I + 1 clusters. 

3. Step I = p — 1: stop. The partition in one cluster is obtained. 

This algorithm is implemented in the function hclustvar which builds a hierarchy of the p 
variables. The function plot . hclustvar gives the dendrogram of this hierarchy. The height 
of a cluster C = ^ U i? in this dendrogram is defined as h(C) = d(A,B). It is easy to 
verify that h{C) > but the property C S ^ h{^) ^ h{B)" has not been proved yet. 
Nevertheless, inversions in the dendrogram have never been observed in practice neither on 
simulated data nor on real data sets. Finally the function cutreevar cuts this dendrogram 
and gives one of the p nested partitions according to the number K of cluster given in input 
by the user. 

The partitioning algorithm. This partitioning algorithm requires the definition of a simi- 
larity measiure between two variables of any type (quantitative or qualitative). We use for this 
purpose the squared canonical correlation between two data matrices E and F of dimensions 
n X ri and n x r2- This correlation, denoted by s, can be easily calculated as follows: The 
procedure for the its determination is simple: 

{first eigenvalue of the n x n matrix EF'FE' if min(n, ri, = n, 
first eigenvalue of the ri x ri matrix E'FF'E if min(n, ri,r2) = ri, 
first eigenvalue of the r2 x r2 matrix F'EE'F if min(n, ri, r2) = r2- 

This similarity s can also be defined as follows: 

- For two quantitative variables Xj and Xj, let E = Xj and F = Xj where Xj and Xj are 
the standardized versions of Xj and Xj. In this case, the squared canonical correlation 
is the squared Pearson correlation: s(xj,Xj) = . x • 
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- For one qualitative variable Zj and one quantitative variable Xj, let E = Zj and F = Xj 
where Zj is the standardized version of the indicator matrix Gj of the qualitative variable 
Zj. In this case, the squared canonical correlation is the correlation ratio: s(zj,Xj) = 

- For two qualitative variables Zj and Zj having r and s categories, let E = Zj and 
F = Zj. In this case, the squared canonical correlation s(zj,Zj) does not correspond to 
a well known association measure. Its interpretation is geometrical: the closer to one 
is s{zi,Zj), the closer are the two linear subspaces spanned by the matrices E and F. 
Then the two qualitative variables Zj and zj bring similar information. 

This similarity measure is implemented in the function mixedVarSim. 

The clustering algorithm implemented in the function kmeansvar builds then a partition in 
K clusters in the following way: 

1. Initialization step: two possibilities are available. 

(a) A non random initialization: an initial partition in K clusters is given in input (for 
instance the partition obtained by cutting the dendrogram of the hierarchy) . 

(b) A random initialization: 

i. K variables are randomly selected among the p variables as initial central 
synthetic variables (named centers hereafter). 

ii. An initial partition into K clusters is built by allocating each variable to the 
cluster with the closest initial center: the similarity between a variable and an 
initial center is calculated using the function mixedVarSim. 

2. Repeat 

(a) A representation step: the quantitative central synthetic variable of each cluster 
Ck is calculated with PCAMIX as defined in section 2. 

(b) An allocation step: a partition is constructed by assigning each variable to the clos- 
est cluster. The similarity between a variable and the central synthetic quantitative 
variable of the corresponding cluster is calculated with the function mixedVarSim: 
it is either a squared correlation (if the variable is quantitative) or a correlation 
ratio (if the variable is qualitative). 

3. Stop if there is no more changes in the partition or if a maximum number of iterations 
(fixed by the user) is reached. 

This iterative procedure kmeansvar provides a partition Pk into K clusters which maximizes 
Ti but this optimum is local and depends on the initial partition. A solution to overcome this 
problem and to avoid the influence of the choice of an arbitrary initial partition is to consider 
multiple random initializations. In this case, steps 1(b), 2 and 3 are repeated, and we propose 
to retain as final partition the one which provides the highest value of Ti . 
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Stability of partitions of variables. This procedure evaluates the stability of the p nested 
partitions of the dendrogram obtained with hclustvar. It works as follows: 

1 . B boostrap samples of the n observations are drawn and the corresponding B dendro- 
grams are obtained with the function hclustvar. 

2. The partitions of these B dendrograms are compared with the partitions of the initial 
hierarchy using the corrected Rand index. The Rand and the adjusted Rand indices are 
implemented in the function Rand (see Hubert and Arable (1985) for details on these 
indices). 

3. The stability of a partition is evaluated by the mean of the B adjusted Rand indices. 

The plot of this stability criterion according to the number of clusters can help the user in the 
choice of a sensible and suitable number of clusters. Note that an error message may appear 
with this function in some case of rare categories of qualitative variable. Indeed, if this rare 
category disappears in a bootstrap sample of observations, a column of identical values is 
then formed and the standardization of this variable is not possible in PCAMIX step. 

4. Illustration on simple examples 

We illustrate our R package ClustOfVar on two real datasets: the first one only concerns 
quantitative variables, the second one is a mixture of quantitative and qualitative variables. 

4.1. First example: Quantitative data 

We use the dataset decathlon which contains n = 41 athletes described according to their 
performances in p = 10 different sports of decathlon. 

R> library ("ClustOfVar") 

R> data( "decathlon") 

R> head (decathlon [,1:4]) 



100m Long. jump Shot. put High. jump 



SEBRLE 


11, 


.04 


7, 


,58 


14. 


,83 


2, 


,07 


CLAY 


10, 


.76 


7, 


,40 


14. 


,26 


1. 


,86 


KARPOV 


11, 


.02 


7, 


,30 


14. 


,77 


2, 


,04 


BERNARD 


11, 


.02 


7, 


,23 


14. 


,25 


1, 


,92 


YURKOV 


11, 


.34 


7, 


,09 


15. 


,19 


2, 


,10 


WARNERS 


11, 


,11 


7, 


.60 


14. 


,31 


1. 


,98 



In order to have an idea of the links between these 10 quantitative variables, we will construct 
a hierarchy with the function hclustvar. 

R> X.quanti <- decathlon [, 1 : 10] 
R> tree <- hclustvar (X.quanti) 
R> plot (tree) 
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Aggregation levels 



Cluster Dendrogram 



number of cluslers 



Figure 1: Graphical output of the function plot .hclustvar. 

In Figure 1, the plot of the aggregation levels suggests to choose 3 or 5 clusters of variables. 
The dendrogram, on the right hand side of this figure, shows the link between the variables in 
terms of r^. For instance, the two variables "discus" and "shot put" are linked as well as the 
two variables "Long.jump" and "400m", but the user must keep in mind that the dendrogram 
does not indicate the sign of these relationships: a careful study of these variables shows that 
"discus" and "shot put" are correlated whereas "Long .jump" and "400m" are anti-correlated. 

The user can use the stability function in order to have an idea of the stability of the 
partitions of the dendrogram represented in Figure 1. 

fi> stab <- stabilityCtree,B=40} 

R> plot (stab, J3iaiii="StabiIity of the partitions") 
R> stah$matCR 

R> boxplot (stab$matCR, main="Dispersion of the ajusted Rand index") 

On the left of Figure 2, the plot of the mean (over the = 40 bootstrap samples) of the 
adjusted Rand indices is obtained with the function plot . clustab. It clearly suggests to 
choose 5 clusters. The boxplots on the right of Figure 2 show the dispersion of these indices 
over the B = 40 bootstrap replications for partition, and they suggest 3 or 5 clusters. 

In the following we choose K = 3 clusters because PCA applied to each of the 3 clusters gives 
each time only one eigenvalue greater than 1. The function cutree cuts the dendrogram of 
the hierarchy and gives a partition into K = 3 clusters of the p = 10 variables: 

R> P3<-cutreevar (tree , 3) 
R> cluster <- P3$cluster 

R> princompCX .quant i [, which(cluster==l)] , cor=TRUE) $sdev~2 
R> princompCX .quant i [, which(cluster==2)] , cor=TRUE) $sdev~2 
R> princomp (X . quant i [, which (cluster==3)] , cor=TRUE) $sdev~2 



The partitionP3 is contained in an object of class clustvar. Note that partitions obtained 
with the kmeansvar function are also objects of class clustvar. The function print . clustvar 
gives a description of the values of this object. 
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stability of the partitions Dispersion of tfie ajusted Rand index 




Figure 2: Graphical output of the functions stability and plot . clustab. 

fi> P3<-cutreevar (tree , 3 ,matsim=TRUE) 
R> print (PS) 

Call: 

cutreevar (obj = tree, k = 3) 

name description 

$var" "list of variables in each cluster" 

$sim" "similarity matrix in each cluster" 

$cluster" "cluster memberships" 

$wss" "within-cluster sum of squares" 

$E" "gain in cohesion (in %) " 

$size" "size of each cluster" 

$scores" "score of each cluster" 

The value $wss is T-L{Pk) where the homogeneity function % was defined in (2). The gain in 
cohesion $E is the percentage of homogeneity which is accounted by the partition Pk- It is 
defined by: 

^^^^^ - p-niP,) ■ 

The value $sim provides the similarity matrices of the variables in each cluster (calculated 
with the function mixedVarSim). Note that it is time consuming to perform these similarity 
matrices when the number of variables is large. Thus they are not calculated by default: 
matsim=TRUE must be specified in the parameters of the function hclustvar (or kmeansvar) 
if the user wants this output. We provide below the similarity matrix for the first cluster of 
this partition into 3 clusters. 

> round CP3$siin$clusterl,dig-it=2) 



100m 



100m Long. jump 400m 110m. hurdle 
1.00 0.36 0.27 0.34 
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Long. jump 0.36 1.00 0.36 0.26 

400m 0.27 0.36 1.00 0.30 

llOm. hurdle 0.34 0.26 0.30 1.00 



The value $cluster is a vector of integers indicating the cluster to which each variable is 
allocated. 

R> P3$cluster 



100m Long. jump Shot. put High. jump 400m 110m. hurdle 

112 2 11 

Discus Pole. vault Javeline 1500m 

2 3 2 3 

The value $var gives a description of each cluster of the partition. More precisely it provides 
for each cluster the squared loadings on the first principal component of PCAMIX (which is the 
central synthetic variable of this cluster). For quantitative variables (resp. qualitative), the 
squared loadings are squared correlations (resp. correlation ratio) with this central synthetic 
variable. For instance the squared correlation between the variable "100m" and the central 
synthetic variable of "clusterl" is 0.68. 

R> P3$var 



$clusterl 
100m 

Long. jump 

400m 

110m. hurdle 



squared loading 
. 6822349 
0.6873076 
0.6652279 
0.6427661 



$cluster2 

Shot . put 
High. jump 
Discus 
Javeline 



squared loading 
0.7861012 
0.4991778 
0.6023186 
0.2546550 



$cluster3 

squared loading 
Pole. vault 0.6237239 
1500m 0.6237239 



The value $scores is the nxK matrix of the scores of the n observations on the first principal 
components of PCAMIX applied to the K clusters: PCAMIX is applied 3 times here, one time 
in each cluster. Each column is then a synthetic variable of a cluster. The central synthetic 
variable of "clusterl" for instance is the first column of the 41 x 3 matrix above. This column 
gives the scores of the 41 athletes on the first component of PCAMIX applied to the variables 
of "clusterl" (100m, Long.jump, 400m, llOm.hurdle). 
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fi> head(part_hier$scores) 



clusterl 
SEBRLE . 2640687 
CLAY 1.3816943 
KARPOV 1 . 1098485 
BERNARD -0.1949061 
YURKOV -2.0319539 
WARNERS 1.1385110 



cluster2 cluster3 
-1.0353928 -1.4405915 
-0.3454687 -1.7840860 
-0.7209119 -1.7043603 

0.7082857 -1.5017373 
-1.8850107 0.2702640 

1.0929346 -0.3490226 



Note that this 41 x 3 matrix of the scores of the 41 athletes in each cluster of variables is of 
course different from the 41 x 3 matrix of the scores of the athletes on the first 3 principal 
components of PCAMIX (here PCA) applied to the initial dataset. The 3 synthetic variables 
for instance can be correlated whereas the first 3 principal components of PCAMIX are not 
correlated by construction. But the matrix of the synthetic variables in $scores can be used 
as the matrix of the principal components of PCAMIX for dimension reduction purpose. 



4.2. Second example: A mixture of quantitative and qualitative data 

We use the dataset wine which contains n = 21 french wines described by p = 31 variables. 
The first two variables "Label" and "Soil" are qualitative with respectively 3 and 4 categories. 
The other 29 variables are quantitative. 

fi> dataC'wine") 

R> head(wiiie[,c(l:4)]) 



Label Soil Odor . Intensity Aroma. quality 

2EL Saumur Envl 3.074 3.000 

ICHA Saumur Envl 2.964 2.821 

IFDN Bourgueuil Envl 2.857 2.929 

IVAU Chinon Env2 2.808 2.593 

IDAM Saumur Reference 3.607 3.429 

2B0U Bourgueuil Reference 2.857 3.111 



In order to have an idea of the links between these 31 quantitative and qualitative variables, 
we construct a hierarchy using the function hclustvar. 

R> X. quant i <- wine [,c (3: 29)] 

R> X.quali <- wine[,c(l,2)] 

R> tree <- hclustvar(X.quanti, X.quali) 

R> plot (tree) 

In Figure 3, we plot the dendrogram. It shows for instance that the qualitative variable "label" 
is linked (in term of correlation ratio) with the quantitative variable "Phenolic". The user 
chooses according to this dendrogram to cut this dendrogram into K = 6 clusters: 

R> part_hier<-cutreevar(tree ,6) 
R> part_hier$var$" clusterl" 
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Figure 3: Dendrogram of the hierarchy of the 31 variables of the wine dataset. 



squared loading 

Odor. Intensity 0.7617528 

Spice. before. shaking 0.6160243 

Odor. Intensity. 1 0.6663325 

Spice 0.5357837 

Bitterness 0.6620632 

Soil 0.7768805 



A close reading of the output for "clusterl" shows that the correlation ratio between the 
qualitative variable "Soil" and the synthetic variable of the cluster is about 0.78. The squared 
correlation between "Odor. Intensity" and the synthetic variable of the cluster is 0.76. 

The central synthetic variables of the 6 clusters are in part_hier$scores. This 21 x 6 quanti- 
tative matrix can replace the original 21x31 data matrix mixing qualitative and quantitative 
variables. This matrix of the synthetic variables can then be used for recoding a mixed data 
matrix (or a qualitative data matrix) into a quantitative data matrix, as is usually done with 
the matrix of the principal components of PCAMIX. 

The function kmeansvar can also provide a partition into K = 6 clusters of the 31 variables. 
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fi> part_km<-kineansvar (X . quant i , X . qual i , init=6 , ns tart =1 0) 

The gain in cohesion of the partition in (4) obtained with the k-means type partitioning 
algorithm and 10 random initiahzations is smaller than that of the partition obtained with 
the hierarchical clustering algorithm (51.02 versus 56.84): 

R> part_km$E 

[1] 51.02414 
R> part_hier$E 

[1] 56.84082 

In practice, simulations and real datasets showed that the quality of the partitions obtained 
with hclustvar seems to be better than that obtained with kmeansvar. But for large datasets 
(with a large number of variables), the function hclustvar meets problems of computation 
time. In this case, the function kmeansvar will be faster. 

5. Concluding remarks 

The R package ClustOfVar proposes hierarchical and k-means type algorithms for the clus- 
tering of variables of any type (quantitative and/or qualitative). 

This package proposes useful tools to visualize the links between the variables and the re- 
dundancy in a data set. It is also an alternative to principal component analysis methods 
for dimension reduction and for recoding qualitative or mixed data matrices into quantitative 
data matrix. The main difference between PCA and the approach of clustering of variables 
presented in this paper, is that the synthetic variables of the clusters can be correlated whereas 
the principal components are not correlated by construction. 

The package ClustOfVar is not performing well with datasets having very large number of 
variables: the computational time becomes relatively long. A future work is to propose a new 
version of the package with versions of the functions hclustvar, kmeansvar and stability 
developed for parallel computing. 

We mention that the package ClustOfVar can deal with missing data. However let us note 
that the imputation method used in the code is simple and may not perform well when the 
proportion of missing data is too large. In that case, one of the numerous R packages devoted 
to missing data imputation should be used prior to ClustOfVar. 
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